from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))
Zoe Wang
2018.09.2
During this project i want to do the rent price in different New Zealand area, firstly i find working type, area and some of salary in seek by web scarping. However, their data is not very accurate, and I also face theanti web scraping of seek. so i change my mind to use API by https://api.business.govt.nz/services/v1/tenancy-services/market-rent/statistics I got 30 days permission(from 28/08/2018), which is the rent datasets below. Those datasets come from Tenancy Services. Also i got some datasets frome https://www.qv.co.nz/property-trends/residential-house-values which is about House price from different area of New Zealand
I tried to find the most suitable predictive analysis model by integrating and analyzing the two data sets, so I used linear regression analysis and knn analysis. When using linear regression, I looked at the correlation of the data in advance, which is very unfortunate. My second dataset did not find a property with a high correlation coefficient, so I used the sample difference data of the rent dataset to analyze it and got a prediction model of r-square at 0.5. I also tried to use the correlation. The higher value of the coefficient 5+ bedroom only yields a prediction model with an r-square of 0.3. At the same time, I brought the properties of these two r-squares into knn for analysis, and obtained two models of KNN regression. I tried different k values and could not get valid predictions.
In linear regression, I also proposed a 95% confidence interval, which helps me to improve the accuracy of the prediction in a normal distribution to an interval value, but because of the incompleteness of the data set, I think it is better than me. The prediction model is very accurate, and I personally think that more comprehensive data information should be collected to get a more reliable prediction model.
Wine review:
nLodged - Number of bonds lodged at some point in the period. Note random rounding is applied to this value.
nClosed - Number of bonds closed at some point in the period. Note random rounding is applied to this value.
nCurr - Total number of bonds active at the end of the period. Note random rounding is applied to this value.
mean - Mean weekly rent of bonds lodged within the period.
lq - Lower Quartile weekly rent; weekly rent of the bond that is at the 25th percentile of bonds lodged in the period
med - Median weekly rent; weekly rent of the bond that is at the 50th percentile of bonds lodged in the period
uq - Upper Quartile weekly rent; weekly rent of the bond that is at the 75th percentile of bonds lodged in the period
sd - Sample Standard Deviation of weekly rent
brr - Mean Bond/Rent Ratio
lmean - Mean of natural logarithm weekly rent. Note that exp(lmean) == Geometric mean is a good estimate of the median as rent is log normally distributed so can be thought of as the "Synthetic median" of market rent consistent with the other synthetic statistics below
lsd - sample standard deviation of natural logarithm weekly rent of bonds lodged within the period
slq - Synthetic Lower Quartile Weekly Rent. This is defined as exp(lmean + qnorm(0.25) * lsd) and is a reasonable estimate of the lower quartile
suq - Synthetic Upper Quartile Weekly Rent. This is defined as exp(lmean + qnorm(0.75) * lsd) and is a reasonable estimate of the upper quartile
import pandas as pd
import numpy as np
import matplotlib as mtpl
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10
rcParams['font.size'] = 20
rcParams['figure.dpi'] = 350
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['patch.edgecolor'] = 'white'
rcParams['font.family'] = 'StixGeneral'
import os
import sys
import json
import requests,re
import pandas as pd
from nltk import clean_html
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
!pip install requests requests_oauthlib
import requests
from requests_oauthlib import OAuth1Session
from requests_oauthlib import OAuth1
#Get earthquake API information
urlrt = "https://api.business.govt.nz/services/v1/tenancy-services/market-rent/statistics?period-ending=2018-06&num-months=12&area-definition=REGC2016&include-aggregates=false"
rent = requests.get(urlrt, headers={'Authorization': 'Bearer 42c167731d734bc78337497f8721ba1d'})
rent
rentinfo = json.loads(rent.content)
rentinfo
rent_info = []
for i in range(len(rentinfo['items'])):
rent_info.append({
'Location':rentinfo['items'][i]['area'],
'Houeing Type':rentinfo['items'][i]['dwell'],
'Number of Bedrooms':rentinfo['items'][i]['nBedrms'],
'Mean of rent':rentinfo['items'][i]['mean'],
'lmean':rentinfo['items'][i]['lmean'],
'lq':rentinfo['items'][i]['lq'],
'uq':rentinfo['items'][i]['uq'],
'sd':rentinfo['items'][i]['sd'],
'brr':rentinfo['items'][i]['brr'],
'slq':rentinfo['items'][i]['slq'],
'suq':rentinfo['items'][i]['suq'],
'nCurr':rentinfo['items'][i]['nCurr'],
'lsd':rentinfo['items'][i]['lsd'],
'nLodged':rentinfo['items'][i]['nLodged']
})
rent = pd.DataFrame(rent_info)
rent.head()
#Get House price during 2017.07-2018.05
hsh1 = pd.read_csv('.../datasets/All-2018-08-29 102400324.csv')
hsh2 = pd.read_csv('.../datasets/All-2018-08-29 102506553.csv')
hsh3 = pd.read_csv('.../datasets/All-2018-08-29 102517261.csv')
hsh4 = pd.read_csv('.../datasets/All-2018-08-29 102638289.csv')
hsh5 = pd.read_csv('.../datasets/All-2018-08-29 102648559.csv')
hsh6 = pd.read_csv('.../datasets/All-2018-08-29 102700412.csv')
hsh7 = pd.read_csv('.../datasets/All-2018-08-29 102740339.csv')
hsh8 = pd.read_csv('.../datasets/All-2018-08-29 102751571.csv')
hsh9 = pd.read_csv('.../datasets/All-2018-08-29 102759493.csv')
hsh10 = pd.read_csv('.../datasets/All-2018-08-29 102808165.csv')
hsh11 = pd.read_csv('.../datasets/All-2018-08-29 102818251.csv')
hsh12 = pd.read_csv('.../datasets/All-2018-08-29 102826954.csv')
hsh13 = pd.read_csv('.../datasets/All-2018-08-29 102837189.csv')
hsh14 = pd.read_csv('.../datasets/All-2018-08-29 131439302.csv')
#Merge ever month house price dataframe
house_df1 = pd.merge(hsh1, hsh2, on='Area', suffixes=('_Jan', '_Feb'))
house_df2 = pd.merge(hsh3, hsh4, on='Area', suffixes=('_Mar', '_Apr'))
house_df3 = pd.merge(hsh5, hsh6, on='Area', suffixes=('_May', '_Jul'))
house_df4 = pd.merge(hsh7, hsh8, on='Area', suffixes=('_Jun', '_Aug'))
house_df5 = pd.merge(hsh9, hsh10, on='Area', suffixes=('_Sep', '_Oct'))
house_df6 = pd.merge(hsh11, hsh12, on='Area', suffixes=('_Nov', '_Dce'))
#Merge month dataframe
house_price1 = pd.merge(house_df1, house_df2, on='Area')
house_price2 = pd.merge(house_df3, house_df4, on='Area')
house_price3 = pd.merge(house_df5, house_df6, on='Area')
house_price4 = pd.merge(house_price1, house_price2, on='Area')
house_price5 = pd.merge(house_price3, hsh14, on='Area')
house_price = pd.merge(house_price4, house_price5, on='Area')
house_price = pd.DataFrame(data=house_price )
house_price.head()
#Remove '$',',' and '%' in dataframe
for i in house_price:
house_price[i] = house_price[i].str.lstrip('$').str.rstrip('%').str.replace(',', '')
#Rename Area to Location
house_price = house_price.rename(columns={'Area':'Location'})
house_price = house_price.rename(columns={'Average value June 2017_x':'Average value June 2017'})
#Get 2017 house price
houe_price_2017 = house_price.drop(['Average value January 2018','Change value_Jan','Average value February 2018', 'Change value_Feb', 'Change value_Feb','Average value March 2018','Average value April 2018','Change value_Apr','Average value May 2018','Change value_May','Average value June 2016','Average value June 2018','Change value_Jun','Average value June 2017_y','Change value_Jul','Average value July 2016','Change value_Aug','Average value August 2016','Change value_Sep','Average value September 2016','Change value_Oct','Average value October 2016','Change value_Nov','Average value November 2016','Average value December 2016','Change value','Change value_Mar','Change value_Dce'], axis=1)
houe_price_2017.head()
这个部分我想合并所有的相关数据并处理确实数据值
rent.dtypes
#Change house price data type frome object to float
#house_price[['Average value January 2018']]=pd.DataFrame(house_price[['Average value January 2018']],dtype=np.float)
#house_price[['Average value January 2018','Average value January 2017','Change value_Jan']] = house_price[['Average value January 2018','Average value January 2017','Change value_Jan']].apply(pd.to_numeric)
house_price = house_price.apply(pd.to_numeric, errors='ignore')
houe_price_2017 = houe_price_2017.apply(pd.to_numeric, errors='ignore')
house_price.dtypes
rent.shape
house_price.shape
rent.isnull().sum()
houe_price_2017.isnull().sum()
rent.describe(include=[np.number])
houe_price_2017.describe(include=[np.number])
我观察rent的dataframe中出现了两个min值为零的数一个为nCurr,该值为在周期末活跃磅金的的总数,该值有可能能为零即为未有房屋出租。同时lsd为样本磅金标准差的对数,出现了0的状况,这意味样本标准差出现了极小现象也就是sd出现了missing value
#填补rent表格中标准差的mission
#std = sqrt(mean()abs(x-x.mean())**2)
#np.std(dataframe,ddpf=1)
rent = rent.fillna(np.std(rent,ddof=1))
#only get float 2
rent = rent.round(2)
rent['Location'].value_counts()
houe_price_2017['Location'].value_counts()
#Matching location from rent to house price 2017
#NorthLand Region
NorthLand = houe_price_2017.loc[[4,5,6,],:]
#Auckland Region
Auckland = houe_price_2017.loc[[7,10,14,15,20,24],:]
#Waikato Region
Waikato = houe_price_2017.loc[[25,26,27,28,29,30,35,36,37,38,39],:]
#Bay of Plenty Region
Bay_of_Plenty = houe_price_2017.loc[[40,41,42,43,44,45],:]
#Gisborne Region
Gisborne = houe_price_2017.loc[[46],:]
#Hawke's Bay Region
Hawkes_Bay = houe_price_2017.loc[[47,48,49,50],:]
#Taranaki Region
Taranaki = houe_price_2017.loc[[51,52,53],:]
#Manawatu-Wanganui Region
Manawatu_Wanganui = houe_price_2017.loc[[54,55,56,57,58,59,60],:]
#Willington Region
Willionton = houe_price_2017.loc[[61,62,63,64,65,70,71,72],:]
#Marlborough Region
Marlborough = houe_price_2017.loc[[75],:]
#West Coast Region
West_Coast = houe_price_2017.loc[[73,77,78,79],:]
#Tasman Region
Tasman = houe_price_2017.loc[[73],:]
#Canterbury Region
Canterbury = houe_price_2017.loc[[76,80,81,82,88,89,90,91,92],:]
#Otago Region
Otago = houe_price_2017.loc[[93,94,95,96,101],:]
#Nelson Region
Nelson = houe_price_2017.loc[[74],:]
#Southland Region
Southland = houe_price_2017.loc[[102,103,104],:]
#Merge all district which is belong to NorthLand Region
match_data_northland = {'Far North District':'Northland Region','Whangarei District':'Northland Region','Kaipara District':'Northland Region'}
NorthLand['Location'].replace(match_data_northland, inplace=True)
#Merge all district which is belong to Auckland Region
match_data_akl = {'Rodney District':'Auckland Region','North Shore City':'Auckland Region','Waitakere City':'Auckland Region','':'Auckland Region','Manukau City':'Auckland Region','Papakura District':'Auckland Region'}
Auckland['Location'].replace(match_data_akl, inplace=True)
#Merge all district which is belong to Auckland Region
match_data_Waikato = {'Franklin District':'Waikato Region','Thames-Coromandel District':'Waikato Region','Hauraki District':'Waikato Region','Waikato District':'Waikato Region','Matamata-Piako District':'Waikato Region','Hamilton City':'Waikato Region','Waipa District':'Waikato Region','Otorohanga District':'Waikato Region','South Waikato District':'Waikato Region','Waitomo District':'Waikato Region','Taupo District':'Waikato Region'}
Waikato['Location'].replace(match_data_Waikato, inplace=True)
#Merge all district which is belong to Bay of Plenty Region
match_data_bay = {'Western Bay of Plenty District':'Bay of Plenty Region','Tauranga City':'Bay of Plenty Region','Rotorua District':'Bay of Plenty Region','Whakatane District':'Bay of Plenty Region','Kawerau District':'Bay of Plenty Region','Opotiki District':'Bay of Plenty Region'}
Bay_of_Plenty['Location'].replace(match_data_bay, inplace=True)
#Merge all district which is belong to Auckland Region
match_data_hawkes = {'Wairoa District':'Hawkes Bay Region','Hastings District':"Hawkes Bay Region",'Napier City':"Hawkes Bay Region",'Central Hawkes Bay District':"Hawkes Bay Region"}
Hawkes_Bay['Location'].replace(match_data_hawkes, inplace=True)
#Merge all district which is belong to Taranaki Region
match_data_Taranaki = {'New Plymouth District':'Taranaki Region','Stratford District':'Taranaki Region','South Taranaki District':'Taranaki Region'}
Taranaki['Location'].replace(match_data_Taranaki, inplace=True)
#Merge all district which is belong to Manawatu-Wanganui Region
match_data_Manawatu_Wanganui = {'Ruapehu District':'Manawatu-Wanganui Region','Whanganui District':'Manawatu-Wanganui Region', 'Manawatu District':'Manawatu-Wanganui Region','Rangitikei District':'Manawatu-Wanganui Region', 'Palmerston North City':'Manawatu-Wanganui Region', 'Tararua District':'Manawatu-Wanganui Region','Horowhenua District':'Manawatu-Wanganui Region'}
Manawatu_Wanganui['Location'].replace(match_data_Manawatu_Wanganui, inplace=True)
#Merge all district which is belong to Willionton Region
match_data_Willionton = {'Kapiti Coast District':'Wellington Region','Porirua City':'Wellington Region','Upper Hutt City':'Wellington Region','Lower Hutt City':'Wellington Region','Wellington City':'Wellington Region','Masterton District':'Wellington Region', 'Carterton District':'Wellington Region','South Wairarapa District':'Wellington Region'}
Willionton['Location'].replace(match_data_Willionton, inplace=True)
#Merge all district which is belong to West Coast Region
match_data_West_Coast = {'Tasman District':'West Coast Region', 'Buller District':'West Coast Region', 'Grey District':'West Coast Region','Westland District':'West Coast Region'}
West_Coast['Location'].replace(match_data_West_Coast, inplace=True)
#Merge all district which is belong to Canterbury Region
match_data_Canterbury = {'Kaikoura District':'Canterbury Region','Hurunui District':'Canterbury Region','Waimakariri District':'Canterbury Region', 'Christchurch City':'Canterbury Region','Selwyn District':'Canterbury Region','Ashburton District':'Canterbury Region', 'Timaru District':'Canterbury Region','MacKenzie District':'Canterbury Region','MacKenzie District':'Canterbury Region'}
Canterbury['Location'].replace(match_data_Canterbury, inplace=True)
#Merge all district which is belong to Otago Region
match_data_Otago = {'Waitaki District':'Otago Region','Central Otago District':'Otago Region','Queenstown-Lakes District':'Otago Region', 'Dunedin City':'Otago Region','Clutha District':'Otago Region'}
Otago['Location'].replace(match_data_Otago, inplace=True)
#Merge all district which is belong to Southland Region
match_data_Southland = {'Southland District':'Southland Region','Gore District':'Southland Region','Invercargill City':'Southland Region'}
Southland['Location'].replace(match_data_Southland, inplace=True)
#Merge all district which is belong to Gisborne Region
match_data_Gisborne = {'Gisborne District':'Gisborne'}
Gisborne['Location'].replace(match_data_Gisborne, inplace=True)
#Merge all district which is belong to Marlborough Region
match_data_Marlborough = {'Marlborough District':'Marlborough Region'}
Marlborough['Location'].replace(match_data_Marlborough, inplace=True)
#Merge all district which is belong to Tasman Region
match_data_Tasman = {'Tasman District':'Tasman Region'}
Tasman['Location'].replace(match_data_Tasman, inplace=True)
#Merge all district which is belong to Nelson Region
match_data_Nelson = {'Nelson City':'Nelson Region'}
Nelson['Location'].replace(match_data_Nelson, inplace=True)
#Concat all the location in housr price dataframe
location = [NorthLand, Auckland, Waikato, Bay_of_Plenty, Gisborne, Hawkes_Bay, Taranaki, Manawatu_Wanganui, Willionton, Marlborough, West_Coast, Tasman, Canterbury, Otago, Nelson, Southland]
house_price_2017_new = pd.concat(location)
#Add totoal number of different area
sum_house = house_price_2017_new.sum(1)
house_price_2017_new['Total'] = sum_house
#Merge rent and house price to one dataframe
house_rent = pd.merge(rent,house_price_2017_new, on='Location')
house_rent.head()
#Check missing values
house_rent.isnull().sum()
dummy = pd.get_dummies(house_rent[['Houeing Type', 'Location', 'Number of Bedrooms']])
house_rent_text = pd.concat([house_rent, dummy], axis=1)
house_rent_text
Through the histogram analysis of the house type, we can see that House and Flat are more popular among renters, and the number of people renting a room is the least.
rent.groupby('Houeing Type')['Houeing Type'].count()
sns.factorplot('Houeing Type', data=rent, kind='count', aspect=3)
plt.title('Histogram of House type analysis')
There are we can see that the renting method of a room has little difference in the type of room, but in the data of the five rooms, the least of House is the Apartment, which I think may also be the reason for the construction of the house. At the same time, Room and Apartment are the least in the unknown bedroom data.
rent.groupby(['Houeing Type', 'Number of Bedrooms'])['Houeing Type'].count()
g = sns.factorplot('Number of Bedrooms', data=rent, hue='Houeing Type', kind='count', aspect=1.75)
g.set_xlabels('Different House bedroom')
plt.title('Relation of House type and Bedroom number')
In the map of Location number and Distribution, we can see that the distribution of housing rentals in most areas is relatively average, except for the West Coast Region. At the same time, the density of rental housing is mainly concentrated between 20-26. This includes Auckland Region, Otago Region, Wellington Region, NA, Canterbury Region, Waikato Region, Bay of Plenty Region, Manawatu-Wanganui Region, Taranaki Region, Northland Region these areas.
area = rent['Location'].value_counts()
Area_dist = sns.distplot(area)
Area_dist.set_title("Distribution of Rent Location")
area
The rent range is mainly concentrated in the approximate range of 210 to 560, and there are also a very small number of high prices. At the same time, we can see that the most frequent rent is around 400.
house_rent['Mean of rent'].hist(bins=50)
plt.xlabel('Mean of rent')
plt.title('Histogram of Mean rent')
In the Boxplot chart we can see that the high house prices appear in the Auckland area, followed by the unknown area of NA, and the Wellington area. The lowest rent distribution for the entire New Zealand region is between 180 and 210. The gap between the 95% confidence intervals is quite obvious.
rent[['Mean of rent', 'Location']].boxplot(column='Mean of rent', by='Location', figsize=(20,8))
plt.xticks(rotation=45)
The relationship between the Bonds submitted in 2017 and the average rent is summarized by the different rental Bedroom types. As can be seen from the figure, the Bounds Lodged with the largest difference in rents for 5+ houses is the least active, while the rent range for one bedroom and two bedrooms is dense. The three-bedroom Bounds Lodged is very active and dominates the house type of House.
#Relation between number of bonds lodged at some point in the 2017, bedrooms type and mean of rent
#ax.tick_params(labelsize=8)
g = sns.FacetGrid(house_rent, col="Number of Bedrooms", hue='Houeing Type', aspect=4, col_wrap=2)
g.map(plt.scatter, "Mean of rent", "nLodged", alpha=.7)
g.add_legend();
The table for Boxplot by Location and House Total price is very interesting. The difference between the average value and the lowest value is very large. This proves that the price difference between the different Regions of New Zealand is very uneven, with the Auckland Region as the highest confidence interval. The price of Manawatu-Wanganui Region is the minimum. In a small number of areas, there is only an average due to incomplete data.
house_rent[['Total', 'Location']].boxplot(column='Total', by='Location', figsize=(28,10))
plt.xticks(rotation=45)
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
house_rent_text.pivot_table(values={ 'rent_m',
'Number of Bedrooms',
'brr',
'lmean',
'lq',
'lsd',
'nCurr',
'nLodged',
'sd',
'slq',
'suq',
'uq',
'Average value January 2017',
'Average value February 2017',
'Average value March 2017',
'Average value April 2017',
'Average value May 2017',
'Average value June 2017',
'Average value July 2017',
'Average value August 2017',
'Average value September 2017',
'Average value October 2017',
'Average value November 2017',
'Average value December 2017',
'Total'
}, index=['Location'], aggfunc=[ np.median, np.max, np.min])
#Analysis correlation coefficient
house_rent_text.corr()
#Change unrecognized column names
house_rent_text = house_rent_text.rename(columns={'Number of Bedrooms_5+': 'Nb5','Mean of rent': 'rent_m','Houeing Type':'hstyp' })
#generate the x-axis values that are in range for the Sample Standard Deviation of weekly rent values
x = pd.DataFrame({'sd': np.linspace(house_rent_text.sd.min(), house_rent_text.sd.max(), len(house_rent_text.sd))})
#generate the model which uses the Sample Standard Deviation of weekly rent work score to predict the rent mean
mod = smf.ols(formula='rent_m ~ 1 +sd', data=house_rent_text.dropna()).fit()
#plot the actual data
plt.scatter(house_rent_text.sd, house_rent_text.rent_m, s=20, alpha=0.6)
plt.xlabel('Sample Standard Deviation of weekly rent'); plt.ylabel('Mean of rent')
#render the regression line by predicting the ys using the generated model from above
plt.plot(x.sd, mod.predict(x), 'r', label='Linear $R^2$=%.2f' % mod.rsquared, alpha=0.9)
#give the figure a meaningful legend
plt.legend(loc='upper left', framealpha=0.5, prop={'size':'small'})
plt.title("Predicting rent fee results based on Sample Standard Deviation of weekly rent", fontsize=30)
mod.summary()
#generate the model
mod = smf.ols(formula='rent_m ~ 1 +sd', data=house_rent_text.dropna()).fit()
#extract the parameters for the confidence window
x_pred = np.linspace(house_rent_text.sd.min(), house_rent_text.sd.max(), len(house_rent_text.sd))
x_pred2 = sm.add_constant(x_pred)
#confidence = 95% (alpha=0.05)
sdev, lower, upper = wls_prediction_std(mod, exog=x_pred2, alpha=0.05)
#plot points and confidence window
plt.scatter(house_rent_text.sd, house_rent_text.rent_m, s=10, alpha=0.9)
plt.fill_between(x_pred, lower, upper, color='#888888', alpha=0.2)
#plot the regression line
plt.plot(house_rent_text.sd.dropna(), mod.predict(house_rent_text[['sd']] ), 'b-', label='Linear n=1 $R^2$=%.2f' % mod.rsquared, alpha=0.9)
plt.xlabel('Sample Standard Deviation of weekly rent')
plt.ylabel('Mean Rent')
#generate the x-axis values that are in range for the 5+ Bedroom values
x = pd.DataFrame({'Nb5': np.linspace(house_rent_text.Nb5.min(), house_rent_text.Nb5.max(), len(house_rent_text.Nb5))})
#generate the model which uses the Sample Standard Deviation of weekly rent work score to predict the rent mean
mod = smf.ols(formula='rent_m ~ 1 +Nb5', data=house_rent_text.dropna()).fit()
#plot the actual data
plt.scatter(house_rent_text.Nb5, house_rent_text.rent_m, s=20, alpha=0.6)
plt.xlabel('5+ Bedroom'); plt.ylabel('Mean of rent')
#render the regression line by predicting the ys using the generated model from above
plt.plot(x.Nb5, mod.predict(x), 'r', label='Linear $R^2$=%.2f' % mod.rsquared, alpha=0.9)
#give the figure a meaningful legend
plt.legend(loc='upper left', framealpha=0.5, prop={'size':'small'})
plt.title("Predicting rent fee results based on Bedroom number", fontsize=30)
mod.summary()
#generate the model
mod = smf.ols(formula='rent_m ~ 1 +Nb5', data=house_rent_text.dropna()).fit()
#extract the parameters for the confidence window
x_pred = np.linspace(house_rent_text.Nb5.min(), house_rent_text.Nb5.max(), len(house_rent_text.Nb5))
x_pred2 = sm.add_constant(x_pred)
#confidence = 95% (alpha=0.05)
sdev, lower, upper = wls_prediction_std(mod, exog=x_pred2, alpha=0.05)
#plot points and confidence window
plt.scatter(house_rent_text.Nb5, house_rent_text.rent_m, s=10, alpha=0.9)
plt.fill_between(x_pred, lower, upper, color='#888888', alpha=0.2)
#plot the regression line
plt.plot(house_rent_text.Nb5.dropna(), mod.predict(house_rent_text[['Nb5']] ), 'b-', label='Linear n=1 $R^2$=%.2f' % mod.rsquared, alpha=0.9)
plt.xlabel('5+ Bedroom House Type')
plt.ylabel('Mean Rent')
from sklearn import neighbors
house_rent_text
More linear regression R-squared values, using the Standard Standard Deviation of weekly rent values for KNN analysis, try to use K values of 3, 5, 10, 50, I think k is 5 is more suitable for prediction
#KNN prediction
X = house_rent_text.sd.values
X = np.reshape(X, (len(house_rent_text.sd), 1))
y = house_rent_text.rent_m.values
y = np.reshape(y, (len(house_rent_text.rent_m), 1))
# Fit regression model
x = np.linspace(0, 60, 400)[:, np.newaxis]
n_neighbors = 5
for i, weights in enumerate(['uniform', 'distance']):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_hat = knn.fit(X, y).predict(x)
plt.subplot(2, 1, i + 1)
plt.scatter(X, y, c='k', label='data')
plt.plot(x, y_hat, c='b', label='prediction')
plt.axis('tight')
plt.xlabel('Rent')
plt.legend(loc='upper left')
plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors, weights))
plt.subplots_adjust( hspace=0.5)
plt.show()
I tried to use different features for KNN analysis, hoping to get some different inspirations.
#KNN prediction
X = house_rent_text.Nb5.values
X = np.reshape(X, (len(house_rent_text.Total), 1))
y = house_rent_text.rent_m.values
y = np.reshape(y, (len(house_rent_text.rent_m), 1))
# Fit regression model
x = np.linspace(0, 10, 100)[:, np.newaxis]
n_neighbors = 2
for i, weights in enumerate(['uniform', 'distance']):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_hat = knn.fit(X, y).predict(x)
plt.subplot(2, 1, i + 1)
plt.scatter(X, y, c='k', label='data')
plt.plot(x, y_hat, c='b', label='prediction')
plt.axis('tight')
plt.xlabel('Rent')
plt.legend(loc='upper right')
plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors, weights))
plt.subplots_adjust( hspace=0.5)
plt.show()